DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business green lights the project, they have tasked your data science team to conduct an analysis of existing employee data.
Here I will do a data analysis on a given dataset CaseStudy2-data.csv to identify factors that lead to attrition. I will identify the top three factors that contribute to turnover (backed up by evidence provided by analysis). There may or may not be a need to create derived attributes/variables/features. The business is also interested in learning about any job role specific trends that may exist in the data set (e.g., “Data Scientists have the highest job satisfaction”). I also provide any other interesting trends and observations from the analysis. The analysis will be backed up by robust experimentation and appropriate visualization. Experiments and analysis are conducted in R. I will also build a model to predict attrition.
library(tidyverse) #The "tidyverse" collects some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble. The packages work in harmony to clean, process, model, and visualize data.
library(skimr) #for data summary - so sweet and I like a lot this library
library(mice) #package provides a nice function md.pattern() to get a better understanding of the pattern of missing data
library(VIM) #more helpful visual representation can be obtained using the VIM package for agrr
library(naniar) #https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html (for gg_mis_var) (Missing values)
library(mlbench) #collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository. (also has BostonHousing)
library(caret)
library(mlr)
library(tidyverse)
library(ggthemes)
library(gplots)
library(randomForest)
library(corrplot)
library(kableExtra)
library(plotly)
library(GGally) #for ggpairs
rawdata <- read.csv("https://github.com/hnguye01/hnguye01.github.io/raw/master/DS6306/Data/CaseStudy2-data.csv")
head(rawdata)
## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## 6 6 27 No Travel_Frequently 294 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## 6 10 2 Life Sciences 1 733
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## 6 4 Male 32 3 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## 6 Manufacturing Director 1 Divorced 8793
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 9250 2 Y No 11
## 2 17544 1 Y No 14
## 3 19944 2 Y No 11
## 4 24032 1 Y No 19
## 5 17218 1 Y Yes 13
## 6 4809 1 Y No 21
## PerformanceRating RelationshipSatisfaction StandardHours
## 1 3 3 80
## 2 3 1 80
## 3 3 3 80
## 4 3 3 80
## 5 3 3 80
## 6 4 3 80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1 1 8 3 2
## 2 0 21 2 4
## 3 0 10 2 3
## 4 2 14 3 3
## 5 0 6 2 3
## 6 2 9 4 2
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1 5 2 0
## 2 20 7 4
## 3 2 2 2
## 4 14 10 5
## 5 6 3 1
## 6 9 7 1
## YearsWithCurrManager
## 1 3
## 2 9
## 3 2
## 4 7
## 5 3
## 6 7
view(rawdata) #There are 870 entries, 36 total columns
length(rawdata) #[1] 36
## [1] 36
skim(rawdata) #so sweet 0- for data summary
| Name | rawdata |
| Number of rows | 870 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 27 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| Over18 | 0 | 1 | FALSE | 1 | Y: 870 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 435.50 | 251.29 | 1 | 218.25 | 435.5 | 652.75 | 870 | ▇▇▇▇▇ |
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.00 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.50 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EmployeeCount | 0 | 1 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
| EmployeeNumber | 0 | 1 | 1029.83 | 604.79 | 1 | 477.25 | 1039.0 | 1561.50 | 2064 | ▇▇▇▇▇ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.00 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobInvolvement | 0 | 1 | 2.72 | 0.70 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.71 | 1.11 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.50 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.00 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StandardHours | 0 | 1 | 80.00 | 0.00 | 80 | 80.00 | 80.0 | 80.00 | 80 | ▁▁▇▁▁ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.78 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.00 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
Then the dataset has 870 observations and 36 variables.
Actually by skim(rawdata), we can see there is no missing data in the dataset. However, I will introduce some other codes that can be used to check for missing data as a reference. We only need to run one code to check for missing data.
md.pattern(rawdata)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## 870 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## Education EducationField EmployeeCount EmployeeNumber
## 870 1 1 1 1
## 0 0 0 0
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 870 1 1 1 1 1
## 0 0 0 0 0
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 870 1 1 1 1 1
## 0 0 0 0 0
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 870 1 1 1 1 1
## 0 0 0 0 0
## RelationshipSatisfaction StandardHours StockOptionLevel
## 870 1 1 1
## 0 0 0
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 870 1 1 1 1
## 0 0 0 0
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 870 1 1 1 0
## 0 0 0 0
aggr_plot <- aggr(rawdata, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(rawdata), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## ID 0
## Age 0
## Attrition 0
## BusinessTravel 0
## DailyRate 0
## Department 0
## DistanceFromHome 0
## Education 0
## EducationField 0
## EmployeeCount 0
## EmployeeNumber 0
## EnvironmentSatisfaction 0
## Gender 0
## HourlyRate 0
## JobInvolvement 0
## JobLevel 0
## JobRole 0
## JobSatisfaction 0
## MaritalStatus 0
## MonthlyIncome 0
## MonthlyRate 0
## NumCompaniesWorked 0
## Over18 0
## OverTime 0
## PercentSalaryHike 0
## PerformanceRating 0
## RelationshipSatisfaction 0
## StandardHours 0
## StockOptionLevel 0
## TotalWorkingYears 0
## TrainingTimesLastYear 0
## WorkLifeBalance 0
## YearsAtCompany 0
## YearsInCurrentRole 0
## YearsSinceLastPromotion 0
## YearsWithCurrManager 0
gg_miss_var(rawdata, show_pct = TRUE) + labs(title = "Percent missing of the data") + theme(legend.position = "none", plot.title = element_text(hjust = 0.5), axis.title.y = element_text(angle = 0, vjust = 1))
Then the dataset has no missing data.
We observe by skim() or view() that there are some columns without variation. Then we can drop these columns without affecting our analysis. Observing skim(), we see Over18 has all 870 observations with value Y, EmployeeCount has all 870 observations with value 1, StandardHours has all 870 observations with value 80. In addition, 18 years old is a standard working age and 80 hours/week is high (maybe per 2 weeks - employees receive paycheck per 2 weeks). Then we can drop these three columns.
drop_columns <- which(apply(rawdata, 2, function(x) (length(unique(x)) == 1)))
cols <- names(drop_columns)
rawdata <- rawdata[,-drop_columns]
#Actually, we can drop manually by another code as rawdata <- select(rawdata, -c("Over18","EmployeeCount", "StandardHours")) . We will get the same results finally.
skim(rawdata)
| Name | rawdata |
| Number of rows | 870 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 25 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 435.50 | 251.29 | 1 | 218.25 | 435.5 | 652.75 | 870 | ▇▇▇▇▇ |
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.00 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.50 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EmployeeNumber | 0 | 1 | 1029.83 | 604.79 | 1 | 477.25 | 1039.0 | 1561.50 | 2064 | ▇▇▇▇▇ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.00 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobInvolvement | 0 | 1 | 2.72 | 0.70 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.71 | 1.11 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.50 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.00 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.78 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.00 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
By skim(), we can check again the new dataset and all these three columns have been dropped.
I still want to drop the columns ID and EmployeeNumber. These variables are not related to Salary or Attrition and not usefull for our analysis. They are related to individual identity of each employee. After dropping, I will run skim() to check again the dataset.
rawdata <- select(rawdata, -c("ID","EmployeeNumber"))
skim(rawdata)
| Name | rawdata |
| Number of rows | 870 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.0 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.5 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.0 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.0 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.0 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobInvolvement | 0 | 1 | 2.72 | 0.70 | 1 | 2.0 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.0 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.71 | 1.11 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.5 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.0 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.0 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.0 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.0 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.0 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.0 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.0 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.78 | 0.71 | 1 | 2.0 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.0 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.0 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.0 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.0 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
Then now we have 31 columns in the dataset.
I will convert these numeric variables to factor variables.
factorcolumns <- c("JobInvolvement", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", "WorkLifeBalance")
rawdata[,factorcolumns] <- lapply(rawdata[,factorcolumns], as.factor)
data0 <- rawdata #data0 - dataset that I use for the analysis
skim(data0)
| Name | data0 |
| Number of rows | 870 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| factor | 13 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 730, Yes: 140 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 618, Tra: 158, Non: 94 |
| Department | 0 | 1 | FALSE | 3 | Res: 562, Sal: 273, Hum: 35 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 358, Med: 270, Mar: 100, Tec: 75 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 516, Fem: 354 |
| JobInvolvement | 0 | 1 | FALSE | 4 | 3: 514, 2: 228, 4: 81, 1: 47 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 200, Res: 172, Lab: 153, Man: 87 |
| JobSatisfaction | 0 | 1 | FALSE | 4 | 4: 271, 3: 254, 1: 179, 2: 166 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 410, Sin: 269, Div: 191 |
| OverTime | 0 | 1 | FALSE | 2 | No: 618, Yes: 252 |
| PerformanceRating | 0 | 1 | FALSE | 2 | 3: 738, 4: 132 |
| RelationshipSatisfaction | 0 | 1 | FALSE | 4 | 4: 264, 3: 261, 1: 174, 2: 171 |
| WorkLifeBalance | 0 | 1 | FALSE | 4 | 3: 532, 2: 192, 4: 98, 1: 48 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.83 | 8.93 | 18 | 30.0 | 35.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 815.23 | 401.12 | 103 | 472.5 | 817.5 | 1165.75 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.34 | 8.14 | 1 | 2.0 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.90 | 1.02 | 1 | 2.0 | 3.0 | 4.00 | 5 | ▂▅▇▆▁ |
| EnvironmentSatisfaction | 0 | 1 | 2.70 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▅▆▁▇▇ |
| HourlyRate | 0 | 1 | 65.61 | 20.13 | 30 | 48.0 | 66.0 | 83.00 | 100 | ▇▇▆▇▇ |
| JobLevel | 0 | 1 | 2.04 | 1.09 | 1 | 1.0 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| MonthlyIncome | 0 | 1 | 6390.26 | 4597.70 | 1081 | 2839.5 | 4945.5 | 8182.00 | 19999 | ▇▅▂▁▁ |
| MonthlyRate | 0 | 1 | 14325.62 | 7108.38 | 2094 | 8092.0 | 14074.5 | 20456.25 | 26997 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.73 | 2.52 | 0 | 1.0 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.20 | 3.68 | 11 | 12.0 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| StockOptionLevel | 0 | 1 | 0.78 | 0.86 | 0 | 0.0 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.05 | 7.51 | 0 | 6.0 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.83 | 1.27 | 0 | 2.0 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| YearsAtCompany | 0 | 1 | 6.96 | 6.02 | 0 | 3.0 | 5.0 | 10.00 | 40 | ▇▃▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.20 | 3.64 | 0 | 2.0 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.17 | 3.19 | 0 | 0.0 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.14 | 3.57 | 0 | 2.0 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
Then now we have 13 factor columns and 18 numeric columns in the dataset.
In the next part, I will do Exploratory Data Analysis (or EDA). First, I will analyze the dataset in each variable.